External review synthesis + v1 release roadmap#61
Conversation
Land the Gemini and ChatGPT external-review outputs (two iterations each, six files) targeting the v0.1.0-alpha bundles, plus a structured synthesis under docs/external_review/summaries/ and an integrated v1 release roadmap under docs/release/. Critical finding from review (verified locally by ChatGPT v2): student_public bundles reconstruct converted_within_90_days end-to-end through public relational tables — opportunities.close_outcome plus customers/subscriptions existence recover the target via joins. CLAUDE.md gains a hard constraint forbidding this; the v1 release roadmap's Phase 2 is the structural fix. This PR is planning artifacts only — no code changes. The roadmap defines 7 phases ending with publishing leadforge-lead-scoring-v1 to Kaggle and Hugging Face. Implementation begins in follow-up PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR adds planning and reference documentation for moving from the current alpha bundles toward a public v1 dataset release (leadforge-lead-scoring-v1). It lands the raw external review artifacts (Gemini + ChatGPT), a Claude-authored synthesis/triage of those findings, and a phased v1 release roadmap with an explicit “v1-ready” acceptance-gate contract; it also updates project guidance (CLAUDE.md) and the agent execution plan (.agent-plan.md) to reflect the newly identified relational-leakage blocker.
Changes:
- Add v1 release planning docs: roadmap, design decisions, and acceptance gates (plus a post-v1 backlog).
- Add external review corpus + structured summaries/synthesis/triage under
docs/external_review/. - Update constraints and planning trackers (
CLAUDE.md,.agent-plan.md) to reflect the relational-leakage finding and v1 workstream.
Reviewed changes
Copilot reviewed 22 out of 24 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/release/v1_release_roadmap.md | Defines the 7-phase v1 release plan and operational “v1-ready” criteria. |
| docs/release/v1_release_design.md | Captures v1-specific architectural decisions (snapshot-safe relational export, release validation/reporting, packaging, critique loop). |
| docs/release/v1_acceptance_gates.md | Specifies machine-checkable gates/criteria intended to define “ready to publish.” |
| docs/release/post_v1_roadmap.md | Collects deferred items and longer-horizon improvements after v1 ships. |
| docs/external_review/summaries/recommendations_pass.md | Triage of 22 findings with action codes and v1 vs post-v1 targeting. |
| docs/external_review/summaries/README.md | Index of the external review corpus and synthesized outputs. |
| docs/external_review/summaries/key_findings.md | Action-prioritized synthesis of review findings (critical→low). |
| docs/external_review/summaries/gemini_v2_summary.md | Summary of Gemini v2 review and unique contributions. |
| docs/external_review/summaries/gemini_v1_summary.md | Summary of Gemini v1 review and unique contributions. |
| docs/external_review/summaries/cross_source_takeaways.md | Cross-source theme consolidation and divergence map. |
| docs/external_review/summaries/chatgpt_v2_summary.md | Summary of ChatGPT v2 forensic review and proposed roadmap shape. |
| docs/external_review/summaries/chatgpt_v1_summary.md | Summary of ChatGPT v1 (historical / superseded). |
| docs/external_review/summaries/chatgpt_v1_critique_summary.md | Summary of ChatGPT’s critique of its v1 report (methodology corrections). |
| docs/external_review/summaries/chatgpt_guidance_summary.md | Summary of the “second attempt guidance” methodology spec. |
| docs/external_review/gemini/gemini_report_v2.md | Raw Gemini v2 external review artifact. |
| docs/external_review/gemini/gemini_report_v1.md | Raw Gemini v1 external review artifact. |
| docs/external_review/gemini/.gitkeep | Keeps the Gemini review directory tracked. |
| docs/external_review/chatgpt/leadforge_report_v1_critique.md | Raw ChatGPT critique artifact (methodology + corrections). |
| docs/external_review/chatgpt/chatgpt_report_v2.md | Raw ChatGPT v2 external review artifact (forensic audit + roadmap). |
| docs/external_review/chatgpt/chatgpt_report_v1.md | Raw ChatGPT v1 external review artifact (superseded). |
| docs/external_review/chatgpt/.gitkeep | Keeps the ChatGPT review directory tracked. |
| CLAUDE.md | Adds an explicit hard constraint prohibiting public relational tables that reconstruct the label via joins; links new v1 docs. |
| .agent-plan.md | Adds “Next Up — leadforge-lead-scoring-v1” section and preserves the alpha tracker as historical context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The original roadmap stopped at the phase level; reviewer asked for the phase → N PRs → which-files-each-PR-touches decomposition. Adds: - a "PRs" column to the phase summary table - a new "PR breakdown" section with per-phase enumeration of the 14 planned PRs (planning IDs phase.seq, not GitHub PR numbers) - per-PR labels, file lists, sizes, and intra-phase dependencies Total: 14 PRs targeting the v1.1.0 — Curated dataset v1 release milestone. Phase 3 has 3 PRs (the largest phase); Phases 1 and 4 are single-PR each. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…one naming, CSV split contract Six accepted review findings from PR #61 Copilot review (the three "||"-prefix table claims were false positives and resolved as irrelevant): - COPILOT-1 (v1_release_design.md:37): clarify that flat task splits are emitted as Parquet at tasks/<task_id>/{train,valid,test}.parquet in the current alpha contract; CSV exports are a Phase 5 deliverable. Standardize on `valid.csv` (matches existing `valid.parquet`), drop the `validation.csv`/`valid.csv` divergence. - COPILOT-2/3/4: replace fictional `event_timestamp` with the real per-table timestamp columns (`touch_timestamp`, `session_timestamp`, `activity_timestamp`) plus opportunities' `created_at`. Updates v1_release_design.md (the design statement), v1_acceptance_gates.md (gate G4.4), and CLAUDE.md (hard constraint). Makes all three machine-checkable against the actual schema. - COPILOT-8: resolve git-tag inconsistency. v1_release_design.md:25 no longer mentions `dataset/v1`; standardizes on `leadforge-lead-scoring-v1` as both the git tag and Release name. - COPILOT-9: rename GitHub milestone #7 from "v1.1.0 — Curated dataset v1 release" (which read like a semver package version) to "dataset: leadforge-lead-scoring-v1" — explicitly dataset-scoped, no conflict with the package version decoupling principle. Three references in v1_release_roadmap.md updated to match. Three Copilot threads (COPILOT-5/6/7) claimed tables had `||` prefixes; verified false on inspection — tables use standard `|` delimiters. Resolved as not-applicable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #61 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.Run metadata: |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 24 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # v1 Lead-Scoring Dataset Release Roadmap | ||
|
|
||
| **Target:** Publish `leadforge-lead-scoring-v1` to Kaggle and Hugging Face as a best-in-class educational synthetic CRM dataset family. | ||
| **Source of truth:** This roadmap is derived from `docs/external_review/summaries/recommendations_pass.md` (signed off 2026-05-05). |
| | **v1 (first-release-roadmap)** | 14 | #1, #2, #3, #4, #5, #6, #7, #8 (audit), #9, #10, #11 (minimal), #12 (LLM rubric), #14, #15, #16, #19 | | ||
| | **post-v1 (after-release-roadmap)** | 8 | #8 (full encoding), #11 (full CI), #12 (quantitative), #13, #17, #18, #20 (log-normal) | |
| prefixed with `TBD` are placeholders set in Phase 3 of the v1 release | ||
| roadmap; a release candidate cannot ship until all `TBD`s are resolved. | ||
|
|
||
| This file is the operational definition of done for the v1 release. It is |
|
|
||
| ## Public/instructor diff gate | ||
|
|
||
| - **G9.1** Every public/instructor difference is intentional and listed in `release/EXPOSURE_DELTA.md`. |
| **v1.0.0 released (2026-05-02).** All milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py. | ||
| **leadforge package v1.0.0 released (2026-05-02).** All framework milestones (M0–M13) complete. Teaching dataset series (v1–v7) approved by consumer. Package version bumped to 1.0.0 in pyproject.toml and leadforge/version.py. | ||
|
|
||
| **v0.1.0-alpha datasets shipped (2026-05-05).** Five bundles (intro / intermediate / advanced / intermediate_instructor / tiny_demo) in `release/`, packaged for external review by Gemini + ChatGPT (two iterations each). Synthesis lives under `docs/external_review/summaries/`. **Critical finding:** public relational tables reconstruct the target via joins (verified locally by ChatGPT v2 in `chatgpt_report_v2.md §0`). The v1 release work below addresses this plus the rest of the review consensus. |
* docs: PR 1.1 — relational leakage audit on alpha bundles
Reproduces the ChatGPT v2 finding on the actual 5000-lead alpha bundles
(`release/{intro,intermediate,advanced}/`). All three public tiers
reconstruct `converted_within_90_days` at 100% via every probed path;
the join-derived LR / HistGBM model also achieves AUC = 1.000 in 5-fold CV.
Adds:
- `scripts/probe_relational_leakage.py` — five deterministic reconstruction
paths (A direct, B opportunity won, C customer exists, D subscription
exists, E B∨C∨D) plus a bonus public-relational-only model. The
deterministic reconstruction is exposed as a pure function
`deterministic_relational_reconstruction(...)` designed to lift verbatim
into `leadforge/validation/leakage_probes.py` in PR 3.1.
- `tests/scripts/test_probe_relational_leakage.py` — unit test on a
hand-built four-lead toy bundle covering each path independently, plus
a path-A-collapses-when-label-dropped test that pre-validates Phase 2's
fix shape.
- `docs/release/v1_current_state_audit.md` — per-tier evidence table,
explicit G4.1-G4.6 verdict (all FAIL), pointer to PR 2.1 as the
structural fix.
Also reaffirms G1.1 (dataset name `leadforge-lead-scoring-v1`, already
locked via PR #61's milestone rename) in `v1_acceptance_gates.md`, and
checks off Phase 1 in `.agent-plan.md`.
This PR is evidence + docs only. The structural fix lives in PR 2.1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* review: address self-review findings on PR 1.1
Brutal self-review identified 15 issues; this commit addresses the
correctness/credibility ones. Summary:
Script (scripts/probe_relational_leakage.py):
- Assert lead_id uniqueness in deterministic_relational_reconstruction
(a validator must refuse non-unique keys, not silently misalign).
- Make customers/subscriptions optional inputs — empty frames now
accepted, so the same script runs cleanly on Phase-2-fixed bundles.
- Branch the agg dict properly — was using has_close_outcome flag
*inside* a lambda that pandas would have crashed before invoking.
- Add Phase-2-success ablation: re-run deterministic probes on a
redacted-in-process view to surface the post-fix shape *now*.
- Bonus model gains a without_close_outcome_aggregates variant —
the with-aggregates variant is trivially perfect because
any_closed_won is Path B aggregated; the without variant is the
load-bearing probe. Refactored to sklearn Pipeline (removes
string-based scaler dispatch).
- Add G4.4 snapshot-window probe directly on event tables. Empirical
result: alpha bundles PASS G4.4 (90-day horizon already bounds
timestamps). The audit doc now corrects the prior "FAIL (assumed)"
framing.
- Add --max-accuracy flag for Phase-2 CI gating (exit 2 on threshold
violation).
- Rewrite docstring to clearly distinguish four orthogonal evidence
channels.
Tests (tests/scripts/test_probe_relational_leakage.py):
- Add _binary_metrics coverage (perfect / all-negative / mixed) — the
function had divide-by-zero handling and F1 logic with zero tests.
- Add lead_id-uniqueness rejection test.
- Add CLI smoke test on release/intermediate (skip if not present).
- Add --max-accuracy gate failure test.
- Drop the synthetic edge-case lead — toy bundle is now 3 leads
cleanly covering A/C without B/D.
Audit doc (docs/release/v1_current_state_audit.md):
- Distinguish Path A (label in cleartext) from Paths B-E (joins).
- Add Phase-2 ablation table showing collapse to (1 - conv_rate).
- Report bonus model in two variants; AUC = 1.000 in BOTH variants
is a substantive new finding (non-trivial relational features
carry the leak independently of close_outcome — i.e., omitting
customers/subscriptions tables is structurally required).
- Correct G4.4 verdict to PASS (with caveat about within-horizon
content still leaking).
- Note instructor companion behavior is expected.
- Stop reporting AUC on deterministic 0/1 paths.
- Remove "sanity floor" framing for Path A (understated the violation).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* review: address Copilot inline comments on PR 1.1
Five real issues from the automated review:
1. Path A (label cleartext read) used .astype(bool), which maps NaN to
True. Routed through pandas' nullable "boolean" dtype with explicit
fillna(False) so partially-redacted bundles don't report phantom
positives. Same fix applied to y_true construction in probe_bundle.
2. _binary_metrics F1 returned NaN when precision OR recall was exactly
0.0 — should be 0.0 (zero-skill is well-defined; NaN should signal
undefined). New logic: NaN only when precision/recall is itself NaN;
0.0 when both are zero; harmonic mean otherwise.
3. Module docstring promised exit 1 on "shape errors" but main() only
caught FileNotFoundError. Broadened catch to (FileNotFoundError,
KeyError, ValueError); error message now includes exception type.
4. probe_bundle() previously bailed with an "error" key when
converted_within_90_days was absent. Post-Phase-2 bundles will be
exactly that shape. Now reports path_prediction_rates (fraction of
leads each path flags positive) unconditionally — useful as a
sanity view both before and after the fix. The early-return is now
a "note" key rather than "error", with a message pointing the
reader at the prediction-rates view.
5. New tests:
- F1 zero-precision-and-recall yields zero, not NaN
- Path A with NaN-bearing label column maps NaN -> False
Total: 10 tests, all passing. Lint/format/mypy clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
docs/external_review/summaries/.docs/release/.CLAUDE.mdforbidding public relational tables that reconstruct the label via joins..agent-plan.mdwith the new "Next Up —leadforge-lead-scoring-v1Curated Dataset Release" section, alpha tracker preserved beneath as historical context.Critical finding from review
ChatGPT v2 verified locally that
student_publicbundles reconstructconverted_within_90_daysend-to-end through public relational tables:opportunities.close_outcome == "closed_won"pluscustomers/subscriptionsexistence (which only exist for converted leads) recover the target with 100% accuracy via joins. This is THE blocker for any public Kaggle/HF release. Phase 2 of the v1 roadmap is the structural fix.Roadmap shape (7 phases)
leadforge-lead-scoring-v1leadforge/render/relational_snapshot_safe.py+leadforge/validation/relational_leakage.py; bundle schema v4 → v5release_quality.py+leakage_probes.py+reporting.py; calibration / lift / cross-seed bandsdataset-metadata.json+ HFREADME.mdwith configs + cover imageEffective Semantic Diversityrubric dimensionRecommendations triage
22 review findings processed in
docs/external_review/summaries/recommendations_pass.mdwith explicit action codes (accept/accept-with-different-approach/defer/out-of-scope-issue). 14 items go to v1; 8 to post-v1 (after-release-roadmap); 2 to v2-track issues. The split is documented item-by-item with rationale.What's NOT in this PR
Planning artifacts only — no code changes. Implementation begins in follow-up PRs (Phase 1 first).
Doc inventory
Test plan
recommendations_pass.mdand confirms triage matches intentv1_release_roadmap.mdPhase 2 design and confirms snapshot-safe export shape is correctv1_acceptance_gates.mdand confirms the gate set is the right v1-ready definition🤖 Generated with Claude Code